Prediction of Property Saling Price in New York City using sklearn
Introduction: to understand the impacts of the selected variables on housing prices in New York City, we first develop a model trained by the historical data of property sale price data in 2020. We include eight independent variables drawn from three different datasets, including total population, total housing units, people in labor force, travel time, median household income, total household, offense description for the area, building class category. To develop the model using historical data, we first split the 2020 sale data into the train and test datasets. We then use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the Random Forest Regressor.
Writter:
Yuanhao Zhai, Kaye Li
Requirement meet:
Data is collected through a means more sophisticated than downloading (e.g. scraping, API)
It combines data collected from 3 or more different sources.
The analysis of the data is reasonably complex, involving multiple steps (geospatial joins/operations, data shaping, data frame operations, etc).
You use an osmnx or pandana to perform an analysis of street network data
You use scikit-learn to perform a clustering analysis.
You perform a machine learning analysis with scikit-learn as part of the analysis.
The project includes multiple interactive visualizations that include a significant interactive component (cross-filtering, interactive widgets, etc)
Code
#import the essential packages#base packagesimport numpy as npimport pandas as pdimport geopandas as gpd# Plotting packagesimport seaborn as snsfrom matplotlib import pyplot as pltimport holoviews as hvimport hvplot.pandas# Sodapy API packagesimport requestsfrom sodapy import Socrata# Set a Random Seednp.random.seed(42)pd.options.display.max_columns =999
Get the property saling price data from the NYC Open Data using Socrata API.
property_sales = client.get_all("w2pb-icbu")# Convert to pandas DataFrameprosal_df = pd.DataFrame.from_records(property_sales)
# clean the raw saling price data, drop all NA dataprosal_clean = prosal_df.drop(columns=['apartment_number','census_tract_2020','nta_code']).dropna()# convert the str data to float64 dataprosal_clean['sale_price'] = pd.to_numeric(prosal_clean['sale_price'], errors='coerce')prosal_clean['gross_square_feet'] = prosal_clean['gross_square_feet'].replace(',', '', regex=True)prosal_clean['gross_square_feet'] = pd.to_numeric(prosal_clean['gross_square_feet'], errors='coerce')# drop the extreme data(sale price>10000, gross square feet<30)prosal_filtered = prosal_clean[(prosal_clean['sale_price'] >10000) & (prosal_clean['gross_square_feet'] >30)]# convert df data to geodataframe with crs: 4326prosal_clean_geo = gpd.GeoDataFrame( prosal_filtered, geometry=gpd.points_from_xy(prosal_filtered.longitude, prosal_filtered.latitude), crs="EPSG:4326")
To better compare the property price we have to get the sale price per square feet.
# draw a box plot to explore the dataplt.boxplot(prosal_last_geo['unit_sale_price'])plt.show()
From the plot above, we can see that average price of property is around $400.
And we can also draw a hitogram plot to see the distribution of the data.
Code
plt.hist(prosal_last_geo['unit_sale_price'], bins=25, edgecolor='black')plt.title('Property Price Per Square Feet')plt.xlabel('price per square feet')plt.ylabel('count')plt.show()
Most of the price is between $200 to $600.
1.2 Get the Public Safety Data
To further explain what fator contribute to the property price, we will collect the public safety data by using the NYPD Arrest data, this is a data with the catogory of crime and exact geographic position of crimes.
# get the NYPD Arrest Data (Year to Date) NYPD_arrest = client.get_all("uip8-fykc")# Convert to pandas DataFramepublic_safe = pd.DataFrame.from_records(NYPD_arrest)
# Convert the dataframe to Geodataframe and drop the NA data.safe_geo = gpd.GeoDataFrame( public_safe, geometry=gpd.points_from_xy(public_safe.longitude, public_safe.latitude), crs="EPSG:4326").drop(columns=['geocoded_column']).dropna()safe_geo.head()
And besides the public safety, We assume that the population, household income, labor force, traffic time will also have effect on sale price, so we decide to get the data relate to those topic.
# get the census tract data availableimport cenpyavailable = cenpy.explorer.available()
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def nb_dist(x, y):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def get_faces(triangle):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def build_faces(faces, triangles_is, num_triangles, num_faces_single):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def nb_mask_faces(mask, faces):
#we will use the data from 2020, by American Community Survey estimatesacs = cenpy.remote.APIConnection("ACSDP5Y2020")#explore the data profile included in this database acs.variables.head()
variables = ["NAME","DP05_0001E", #Total population"DP05_0086E", #Total Housing Units"DP03_0002E", #In Labor Force"DP03_0025E", #Mean travel time to work (minutes)"DP03_0062E", #Median household income (dollars)"DP02_0001E", #Total households]
Tips
New York City is composed of 5 boroughs, with each borough also its own county. Manhattan is in New York County, Brooklyn is in Kings County, Queens is in Queens County, the Bronx is in Bronx County, and Staten Island is in Richmond County. The counties and boroughs of New York City are coterminous.
# First, get the NY County Data.nyc_demo_data = acs.query( cols=variables, geo_unit="tract:*", geo_filter={"state": "36", "county": "061"},)
# Then, Get the data from 4 other Countiescounties = [('36', '005'),('36','047'),('36','081'),('36','085')] # Replace with actual FIPS codesfor state, county in counties: new_data = acs.query(cols=variables, geo_unit='tract', geo_filter={'state': state, 'county': county}) nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
# Rename the data to our topicnyc_census = nyc_demo_data.rename(columns={"DP05_0001E": "Total_Population","DP05_0086E": "Total Housing Units","DP03_0002E":"In Labor Force","DP03_0025E":"Travel_Time", "DP03_0062E":"Median_HH_Income","DP02_0001E":"Total_HH"})nyc_census['GEOID'] = nyc_census['state'] + nyc_census['county'] + nyc_census['tract']nyc_census.head()
NAME
Total_Population
Total Housing Units
In Labor Force
Travel_Time
Median_HH_Income
Total_HH
state
county
tract
GEOID
0
Census Tract 165, New York County, New York
6674
3841
3954
32.9
184691
3176
36
061
016500
36061016500
1
Census Tract 166, New York County, New York
6002
3279
2953
33.0
47778
2796
36
061
016600
36061016600
2
Census Tract 167, New York County, New York
6058
3804
3616
30.6
203711
2969
36
061
016700
36061016700
3
Census Tract 168, New York County, New York
5189
2102
1561
44.3
27222
1774
36
061
016800
36061016800
4
Census Tract 169, New York County, New York
8272
5016
5033
31.0
131097
3949
36
061
016900
36061016900
len(nyc_census) # We have 2327 Census Tracts in total.
2327
Code
# Then we will get the polygon of our census tracts in order to merge data with valuesimport pygris
nyc_tracts = pygris.tracts( state="36", county="061", year=2020)for state, county in counties: new_tracts = pygris.tracts( state=state, county=county, year=2020) nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\IPython\core\interactiveshell.py:3448: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
if await self.run_code(code, result, async_=asy):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\IPython\core\interactiveshell.py:3448: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
if await self.run_code(code, result, async_=asy):
# Set up the column transformer with two transformers# ----> Scale the numerical columns# ----> One-hot encode the categorical columnstransformer = ColumnTransformer( transformers=[ ("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols), ])
# Initialize the pipeline# NOTE: only use 20 estimators here so it will run in a reasonable timepipe = make_pipeline( transformer, RandomForestRegressor(n_estimators=20, random_state=42))
3.2 Split 30% & 70%
# Split the data 70/30train_set, test_set = train_test_split(sales_census, test_size=0.3, random_state=42)# the target labels: log of sale pricey_train = np.log(train_set["unit_sale_price"])y_test = np.log(test_set["unit_sale_price"])x_train = train_set[ [ "Total_Population","Total Housing Units","In Labor Force","Travel_Time","Median_HH_Income","Total_HH","ofns_desc","building_class_category",]]x_test = test_set[ ["Total_Population","Total Housing Units","In Labor Force","Travel_Time","Median_HH_Income","Total_HH","ofns_desc","building_class_category",]]
# Fit the training setpipe.fit(train_set, y_train);
# What's the test score?pipe.score(test_set, y_test)
0.4906609622139816
3.3 Use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the RandomForestRegressor.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
def evaluate_mape(model, X_test, y_test):""" Given a model and test features/targets, print out the mean absolute error and accuracy """# Make the predictions predictions = model.predict(X_test)# Absolute error errors =abs(predictions - y_test) avg_error = np.mean(errors)# Mean absolute percentage error mape =100* np.mean(errors / y_test)# Accuracy accuracy =100- mapeprint("Model Performance")print(f"Average Absolute Error: {avg_error:0.4f}")print(f"Accuracy = {accuracy:0.2f}%.")return accuracy
# Initialize the pipelinebase_model = make_pipeline( transformer, RandomForestRegressor(n_estimators=20, random_state=42))# Fit the training setbase_model.fit(x_train, y_train)# Evaluate on the test setbase_accuracy = evaluate_mape(base_model, x_test, y_test)# Initialize the pipeline
Model Performance
Average Absolute Error: 0.3403
Accuracy = 93.23%.
# Initialize the pipelinebest_model = grid_search.best_estimator_# Fit the training setbest_model.fit(x_train, y_train)# Evaluate on the test setbest_accuracy = evaluate_mape(best_model, x_test, y_test)
Model Performance
Average Absolute Error: 0.3354
Accuracy = 93.31%.
Errors:
we first test the model on the 0.3 split of 2020 data and received a test score of 0.49. This score is good enough for us to proceed with this model. For the model on historical data of 2020, we achieved an accuracy of 93.23%; whereas for the trained model, we achieved an accuracy of 93.31%. We then calculated the percentage error for each census tract and created a choropleth map to see the spatial distribution of errors. It can be seen that the error level is consistently low across most of the areas, except two or three areas as outliners. This means that our model is relatively spatially consistent.
3.4 Make a data frame with percent errors and census tract info for each sale in the test set
Create a data frame that has the property geometries, census tract data, and percent errors for all of the sales in the test set.
median_errors = test_set.groupby('GEOID')['percent_error'].median().reset_index()error_map = merged_ct.merge(median_errors, on='GEOID')fig, ax = plt.subplots(1, 1, figsize=(10, 6))error_map.plot(column='percent_error', ax=ax, legend=True, legend_kwds={'label': "Median Percent Error by Census Tract"}, cmap='OrRd') # Choose a colormap that fits your data and preferenceax.set_axis_off()